Principal Component Analysis: Boston Housing Data

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

There are 13 attributes in each case of the dataset. They are:

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population

Target variable is the Median Value/price of the owner occupied home

Analyzing the Data using Data Prep

Thus we see that there are many variables which are correlated to each other.

The issue here is, that there is a lot of collinearity between our predictor variables, for example DIS is highly correlated to INUDS, INOX and AGE and LSAT is highly correlated with RM.

This leads to multicollinearity in the variables which can make any Linear Regression Model unstable

Except Target and RM, nothing else is normally distributed, this might be an issue, as most statistical assumptions hold true only when our data is normally distributed.

Normalizing the data

Thus we see that scaling the variables with Standard Scaler has shifted the mean to 0 in all the variables distribution.

Splitting the Data into Train and Test

Performing PCA

From above we see that 6 Components is fine as they explain 85% of the variation in the data

Fitting the PCA with 6 Components

Biplot between the PCA 1 and PCA 2

The above plot shows the BiPlot between the Principal Component1 and Principal Component2 :

  1. Below Variables are highly Correlated:
  1. The PCA1 has a large positive loadinds with NOX, INDUS, RAD, TAX, CRIM. Thus it basically measures the outside factors or the house. Factors which affect the locality of the house and not the house as such.

  2. PCA2 has a high positive loading with the CHAS, target and RM. Thus it might be looking at the House characteristics in General and whether its near to CHAS river or not.

  3. Further we can also see that there are two cluster of houses in the area.

  4. We can also see few outliers in the top right corner and bottom middle.

Bi Plot between PCA 2 and PCA 3

Components Loadings are Othronormal (Orthogonal + Unit Vecotrs)

Thus the loadings are orthogonal.

Correlation Matrix is 1 only when we center our loadings, because dot product of two matrix is equal to correlation only when we have centered matrix. But there is no reason to normalize them.

Thus from Above we that the dot product between the loadings is 0 and there is no correlation between the various Principal Components. This resolves the problem of Multicolllinearity in our data as we saw earlier.

Thus the PCA Loadings are Orthogonal

Component Scores are Orthorgonal

Thus from above we see that Component Scores are also Orthogonal

Test Validation of PCA

Thus we see that the R-Square value is 65% which is not great though but not bad as well.